Skip to content

fix(lifecycle): drain StreamManager goroutines in tests#1227

Merged
jcfs merged 3 commits into
mainfrom
feature/fix-lifecycle-goleak-xdu
Jun 3, 2026
Merged

fix(lifecycle): drain StreamManager goroutines in tests#1227
jcfs merged 3 commits into
mainfrom
feature/fix-lifecycle-goleak-xdu

Conversation

@jcfs
Copy link
Copy Markdown
Contributor

@jcfs jcfs commented Jun 1, 2026

Summary

  • Fixes CI-only flake where the internal/agent/runtime/lifecycle goleak.VerifyTestMain would fail on slow runners with leaked StreamManager.connectWorkspaceStream + WorkspaceStream.writeLoop / read loop goroutines, even though every individual test passed (see PR feat: full GitLab integration — parity with GitHub #1120 run 26745440163 job 78819616867).
  • Makes agentctl.Client.Close() an absolute drain barrier — tracks every stream goroutine it spawns and waits for them, plus a closed flag that rejects new StreamUpdates / StreamWorkspace dials so a Close racing an in-flight dial cannot strand a fresh WS connection. The restart path in manager_interaction.go switches to per-stream Close helpers so it can keep reusing the client after teardown.
  • Makes StreamManager.Wait() an absolute drain barrier — closes a new internal waitCh that the retry backoff and the connected <-ws.Done() / <-stop> select observe, so drain doesn't depend on the caller closing the external stopCh first. streamContext carries both stops through stopChannelContext so in-flight WS dials cancel on either signal.
  • Local hardware doesn't reliably reproduce the leak. Added a make test-lifecycle-goleak LIFECYCLE_GOLEAK_COUNT=N target (defaults to 20) as the repro hook for the flake, plus a TestClientClose_DrainsStreamGoroutines regression test in the agentctl package.

Test plan

  • make test-lifecycle-goleak LIFECYCLE_GOLEAK_COUNT=20 — 20/20 clean under -race
  • go test -race -count=1 ./internal/agent/runtime/lifecycle/... ./internal/agent/runtime/agentctl/...
  • go vet ./...
  • go build ./...
  • CI Run Backend Tests passes on first attempt

🤖 Generated with Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Jun 1, 2026

Claude finished @jcfs's task in 6m 43s —— View job


Code Review

Todo:

  • Gather context and understand the PR
  • Review streams.go (lifecycle StreamManager drain barrier)
  • Review client.go + workspace_stream.go (agentctl Client.Close drain)
  • Review agent.go and manager_interaction.go
  • Review client_close_test.go (regression test)
  • Review Makefile changes
  • Post findings

Overall Assessment

Solid, well-reasoned fix for the CI goleak flake. The two-pronged approach — closing waitCh in sm.Wait() so the SM goroutines drain without depending on external stopCh, and adding streamWG + closed guard to Client.Close() so it becomes a true drain barrier — directly addresses the two independent race windows described in the PR. The stopChannelContext rewrite is careful and correct. Most findings below are documentation/minor correctness notes; there are no blocking issues.


Findings

client.gostreamWG comment misrepresents workspace goroutine tracking
Inline commentstreamWG only tracks readUpdatesStream; workspace goroutines are on stream.wg and drained via explicit ws.Wait(), not "transitively" through streamWG.

client.go — double-close path in Client.Close()
Inline commentCloseWorkspaceStream() closes the raw *websocket.Conn outside closeOnce, then ws.Close() tries to close it again (logged at Debug). Harmless in practice, but the comment "idempotent via closeOnce" is slightly misleading for this call site. Worth either removing CloseWorkspaceStream() from Close() or adding a note explaining both calls serve different purposes (read-loop exit vs. write-loop wakeup).

streams.gomergeStops goroutine always spawned
Inline comment — Since secondary (= waitCh) is never nil, Done() always spawns mergeStops, including for connectUpdatesStream's context, which returns immediately after the dial. The goroutine is bounded and correct (exits on Wait()), but the primary == nil && secondary == nil early-return in Done() is now dead code. A short note would help future readers not flag this as a leak.

workspace_stream.go — post-dial section doesn't guard concurrent StreamWorkspace callers
Inline comment — Pre-existing gap: the post-dial second-lock check guards only closed, not workspaceStreamConn != nil. Two goroutines that both pass the pre-dial check could both dial and the second would overwrite workspaceStream, orphaning the first conn's goroutines. StreamManager never calls StreamWorkspace concurrently in practice, so this isn't a regression, but adding the check here would seal it for free.

client_close_test.gomock.connected semantics
Inline commentm.once fires on the first connection, not both. Both client goroutines are already live before this wait (both dials returned), so the drain works correctly. A clarifying comment would prevent future readers from reading the once as "both connections are live on the server side."


Minor observations (not inline)

  • manager_interaction.go: the switch from execution.agentctl.Close() to CloseUpdatesStream() + CloseWorkspaceStream() is correct and the comment explains the reasoning well.
  • Makefile: test-lifecycle-goleak target is a useful addition. The CGO_PREFIX variable on the go test line is consistent with the rest of the file.
  • The streamContext always returns a *stopChannelContext now (even for test callers that previously got a plain context). This is intentional — secondary (waitCh) provides the drain signal — but note that every dial now spawns a mergeStops goroutine until sm.Wait().

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR hardens the agent runtime's shutdown lifecycle by adding drain barriers at the Client and StreamManager levels, protecting stream establishment from dial-time races, and validating with CI stress tests and a regression test to prevent goroutine leaks.

Changes

Goroutine Lifecycle Hardening and Drain Barriers

Layer / File(s) Summary
CI test infrastructure for goroutine leak detection
apps/backend/Makefile
New test-lifecycle-goleak Make target runs lifecycle tests with -race flag repeated 20 times (configurable via LIFECYCLE_GOLEAK_COUNT) with a 600s timeout to stress-test for goroutine leaks.
Client drain-barrier structures and Close() coordination
apps/backend/internal/agent/runtime/agentctl/client.go
Client struct gains workspaceStream pointer and closed boolean guard. Close() is refactored from simple sequential shutdown into a coordinated barrier: it marks closed, snapshots the workspace stream, closes both streams, then waits on the workspace stream to ensure all goroutines fully exit before returning.
StreamWorkspace hardening with dial-race detection
apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
StreamWorkspace checks c.closed before dial, constructs WorkspaceStream after dial, re-checks c.closed under lock to close any newly-created connection if the client closed during dial, and identity-guards deferred cleanup to clear the workspaceStream reference only when it matches the specific stream being read.
Regression test for Client.Close() workspace stream draining
apps/backend/internal/agent/runtime/agentctl/client_close_test.go
Adds closeBarrierMockServer (holds WebSocket connections open), test client helper, and TestClientClose_DrainsWorkspaceStream which verifies Client.Close() completes within timeout and subsequent StreamWorkspace calls fail immediately post-close.
StreamManager internal drain barrier and stop-channel merging
apps/backend/internal/agent/runtime/lifecycle/streams.go
StreamManager adds internal waitCh and waitChOnce drain barrier. Creates stopChannelContext to merge external stopCh with internal waitCh via sync.Once-guarded merged channel. Updates Wait() to close waitCh before waiting on goroutine group, and refactors sleepOrStop() and connectWorkspaceStream to always select on waitCh closure for deterministic draining.
Restart path selective stream closure
apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go
RestartAgentProcess replaces broad agentctl.Close() with explicit per-stream closure via CloseUpdatesStream() and CloseWorkspaceStream() to target only stream shutdown without interfering with subsequent stream calls later in the restart flow.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • kdlbs/kandev#1139: The changes to apps/backend/internal/agent/runtime/lifecycle/streams.go (adding internal waitCh drain barrier and updating sleepOrStop/connectWorkspaceStream) directly build on earlier stop-channel and leak-drain wiring in the same file.

Poem

🐰 Goroutines once leaked at shutdown's door,
But drain barriers now hold them ever more,
With dials locked tight and contexts that merge,
The client and manager together converge.
No more races, no leaks—just clean exits pure,
This lifecycle hardening will long endure!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing a goleak issue by draining StreamManager goroutines in tests.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed PR description covers the problem (CI-only goleak flake), explains significant architectural changes to Client.Close() and StreamManager.Wait(), and lists a detailed test plan with validation steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/fix-lifecycle-goleak-xdu

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread apps/backend/internal/agent/runtime/agentctl/client.go Outdated
Comment thread apps/backend/internal/agent/runtime/agentctl/client.go Outdated
Comment thread apps/backend/internal/agent/runtime/lifecycle/streams.go
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
apps/backend/internal/agent/runtime/lifecycle/streams.go (1)

432-449: 💤 Low value

Consider consolidating the duplicate select branches.

The shutdown select logic duplicates the ws.Done() and waitCh cases. Since stopCh can be selected on even when nil (a receive on nil channel blocks forever), you could simplify to a single select block:

select {
case <-ws.Done():
case <-sm.stopCh:
    shutdown()
case <-sm.waitCh:
    shutdown()
}

A receive on a nil channel never proceeds, so when stopCh is nil, only ws.Done() and waitCh are effective. This removes the conditional branching.

That said, the current explicit nil-check approach is clearer about intent and avoids relying on nil-channel semantics, so this is purely stylistic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/internal/agent/runtime/lifecycle/streams.go` around lines 432 -
449, The duplicate select branches can be consolidated: remove the if sm.stopCh
== nil conditional and replace both branches with one select that listens for
<-ws.Done(), <-sm.stopCh, and <-sm.waitCh, calling the existing shutdown()
function in the latter two cases; relying on Go's nil-channel semantics (receive
on nil blocks) will make the <-sm.stopCh case inert when sm.stopCh is nil while
preserving behavior for ws.Done() and sm.waitCh.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@apps/backend/internal/agent/runtime/lifecycle/streams.go`:
- Around line 432-449: The duplicate select branches can be consolidated: remove
the if sm.stopCh == nil conditional and replace both branches with one select
that listens for <-ws.Done(), <-sm.stopCh, and <-sm.waitCh, calling the existing
shutdown() function in the latter two cases; relying on Go's nil-channel
semantics (receive on nil blocks) will make the <-sm.stopCh case inert when
sm.stopCh is nil while preserving behavior for ws.Done() and sm.waitCh.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 82a94bf9-322b-4b5b-80e6-124e0f5ada93

📥 Commits

Reviewing files that changed from the base of the PR and between 1fdeda9 and 0bcdc06.

📒 Files selected for processing (7)
  • apps/backend/Makefile
  • apps/backend/internal/agent/runtime/agentctl/agent.go
  • apps/backend/internal/agent/runtime/agentctl/client.go
  • apps/backend/internal/agent/runtime/agentctl/client_close_test.go
  • apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
  • apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go
  • apps/backend/internal/agent/runtime/lifecycle/streams.go

Comment thread apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
Comment thread apps/backend/internal/agent/runtime/agentctl/client_close_test.go
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR fixes a CI-only goleak flake where StreamManager.connectWorkspaceStream and WorkspaceStream.writeLoop/read-loop goroutines outlived goleak.VerifyTestMain on slow runners. It makes both StreamManager.Wait() and Client.Close() true drain barriers by adding an internal waitCh to the stream manager and a closed flag plus workspaceStream pointer on the client.

  • StreamManager: adds a waitCh (closed by Wait()) that the retry-backoff (sleepOrStop), the connected <-ws.Done() select, and a new stopChannelContext mergeStops goroutine all observe, so Wait() drains goroutines even when the caller never closes the external stopCh. The mergeStops goroutine is now registered on sm.wg to handle the connectUpdatesStream path where the outer goroutine returns before the merge goroutine exits.
  • Client.Close(): now sets closed = true (blocking future StreamWorkspace dials), captures the active workspaceStream, calls both low-level close helpers to unblock the loops, then calls ws.Close() + ws.Wait() to synchronously drain the write/read goroutines. RestartAgentProcess switches to the per-stream helpers so it can keep reusing the same client after teardown.
  • Test & Makefile: adds TestClientClose_DrainsWorkspaceStream as a regression test and a make test-lifecycle-goleak target for stress-reproducing the leak under -race.

Confidence Score: 5/5

Safe to merge — the drain barriers are correctly implemented and the identity guards prevent state corruption during the restart path.

The changes are carefully scoped: waitCh is protected by sync.Once, merge goroutines are correctly registered on sm.wg before they are spawned (inside once.Do while the outer goroutine still holds its own wg count), and the two-phase guard in StreamWorkspace (pre-dial + post-dial closed check) closes the race window without introducing any new lock ordering issues. Identity guards in readWorkspaceStream's defer correctly prevent the old stream from overwriting new-stream pointers during restart. The test directly exercises the drain guarantee and the post-close error path.

No files require special attention — all changes are self-consistent and the implementation matches the documented invariants.

Important Files Changed

Filename Overview
apps/backend/internal/agent/runtime/lifecycle/streams.go Major rework: adds waitCh/waitChOnce to StreamManager; rewrites stopChannelContext to merge two stop channels via sync.Once-guarded goroutine tracked on sm.wg; sleepOrStop and the workspace-stream connected-select both observe waitCh so Wait() is an absolute drain barrier regardless of stopCh
apps/backend/internal/agent/runtime/agentctl/client.go Adds closed flag and workspaceStream pointer; Close() now captures the stream under lock, calls low-level close helpers, then synchronously drains read/write goroutines via ws.Close() + ws.Wait()
apps/backend/internal/agent/runtime/agentctl/workspace_stream.go Adds pre-dial and post-dial closed guard in StreamWorkspace, stores stream on client under lock, and extends readWorkspaceStream's defer with an identity guard for c.workspaceStream matching the existing c.workspaceStreamConn guard
apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go Switches RestartAgentProcess from client.Close() to CloseUpdatesStream() + CloseWorkspaceStream() so the client stays reusable; comment correctly explains why the terminal drain barrier cannot be used here
apps/backend/internal/agent/runtime/agentctl/client_close_test.go New regression test validates workspace stream goroutines drain before Close() returns and that subsequent StreamWorkspace calls error; mock server correctly simulates long-lived WS handlers
apps/backend/Makefile Adds test-lifecycle-goleak target with configurable LIFECYCLE_GOLEAK_COUNT for stress-reproducing the CI leak under -race

Reviews (3): Last reviewed commit: "fix(lifecycle): drop agent-stream drain ..." | Re-trigger Greptile

Comment thread apps/backend/internal/agent/runtime/lifecycle/streams.go
Comment thread apps/backend/internal/agent/runtime/agentctl/client_close_test.go Outdated
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 7 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
Comment thread apps/backend/internal/agent/runtime/lifecycle/streams.go
Comment thread apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
Comment thread apps/backend/internal/agent/runtime/agentctl/client.go Outdated
jcfs and others added 2 commits June 3, 2026 09:19
Goroutines spawned by `agentctl.Client.StreamUpdates` /
`StreamWorkspace` and by `lifecycle.StreamManager.connectWorkspaceStream`
could outlive the tests that created them on slow CI runners, causing
intermittent `goleak.VerifyTestMain` failures in
`internal/agent/runtime/lifecycle` even though every individual test
passed. The leak required the test to race: `client.Close()` could
return while a workspace dial was still in flight, leaving the
just-spawned WS read/write loops with nobody to drain them, and
`StreamManager.Wait()` only fired the drain when the external `stopCh`
had been closed first.

Make the close paths absolute drain barriers:

- `Client.Close()` now tracks every stream goroutine it spawns
  (`streamWG` + a per-stream `WorkspaceStream` reference) and blocks
  until all of them have exited. A `closed` flag flipped under the
  client mutex makes subsequent `StreamUpdates` / `StreamWorkspace`
  calls reject the dial, so a Close that races a dial in flight cannot
  leave behind a stranded WS connection. `manager_interaction.go`
  switches the restart path to per-stream Close helpers so it can keep
  reusing the client after teardown.
- `StreamManager.Wait()` closes a new internal `waitCh` that the retry
  backoff (`sleepOrStop`) and the connected `<-ws.Done() / <-stop>`
  select also observe. This makes `Wait` an absolute drain barrier
  regardless of whether the caller ever closes the external `stopCh`.
  `streamContext` carries both stops through `stopChannelContext` so
  in-flight WebSocket dials cancel on either signal.

Local stress (`make test-lifecycle-goleak LIFECYCLE_GOLEAK_COUNT=20`)
passes 20/20 with `-race` after the fix. The new
`TestClientClose_DrainsStreamGoroutines` covers the drain barrier and
the post-Close rejection path. Added a `test-lifecycle-goleak`
Makefile target as the repro hook for the CI-only flake.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… races

Address PR review feedback:
- stopChannelContext now registers the mergeStops goroutine on an optional
  *sync.WaitGroup so sm.wg.Wait stays a true drain barrier even for the
  connectUpdatesStream path (outer goroutine returns immediately after dial).
- StreamWorkspace re-checks workspaceStreamConn after dial to seal the
  concurrent-callers race that could orphan a conn.
- readWorkspaceStream's defer now guards the workspaceStreamConn reset by
  identity so an old read goroutine cannot zero out a newer conn pointer.
- Clarify comments on streamWG, Client.Close, and the mock.connected wait
  in the close-barrier test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jcfs jcfs force-pushed the feature/fix-lifecycle-goleak-xdu branch from 0bcdc06 to 7562635 Compare June 3, 2026 08:22
@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @jcfs's task in 3m 51s —— View job


Claude finished reviewing — follow-up commit 7562635

Todo

  • Load tools
  • Read updated files for review
  • Verify previous review issues are resolved
  • Check for new issues in the updated code
  • Post inline comments and final assessment

Follow-up Commit Review

The second commit (fix(lifecycle): track mergeStops goroutine and guard workspace stream races) cleanly addresses all issues raised in the first round of review. Here's the resolution status:

Issue Status
streamWG comment misrepresenting workspace goroutine tracking ✅ Fixed — comment now correctly scopes it to readUpdatesStream only
Double-close path in Close() / misleading comment ✅ Fixed — comment now explains both calls serve distinct purposes (read-loop exit vs write-loop wakeup)
mergeStops goroutine not tracked by sm.wg ✅ Fixed — stopChannelContext now carries an optional wg *sync.WaitGroup; streamContext() wires &sm.wg in so every merge goroutine registers on the manager's WaitGroup
Post-dial concurrent StreamWorkspace callers could orphan first conn ✅ Fixed — added the workspaceStreamConn != nil re-check under lock
readWorkspaceStream defer could zero a newer workspaceStreamConn ✅ Fixed — workspaceStreamConn reset now uses the same == conn identity guard that already existed for workspaceStream
mock.connected (sync.Once) comment overstated "both connections are live" ✅ Fixed — comment now explicitly notes once fires for the first accept and that Close() drains both regardless

One minor residual note

The if c.primary == nil && c.secondary == nil early return in stopChannelContext.Done() (streams.go:74) is dead code now that secondary (sm.waitCh) is always non-nil. This was flagged in round 1 as worth a short note so future readers don't wonder why mergeStops "always" spawns. The logic is correct and bounded — just worth a // secondary (sm.waitCh) is always non-nil so this path is unreachable comment or a cleanup in a follow-up.

wg.Add ordering is correct

The wg.Add(1) inside once.Do runs before go c.mergeStops(), and the outer goroutine cannot call sm.wg.Done() before connectUpdatesStream returns — which is after DialContext (the site that calls ctx.Done() and triggers once.Do). So the WaitGroup counter can never reach 0 before mergeStops is registered. The fix is sound.


Overall: All substantive issues from round 1 are resolved correctly. The fix is minimal, well-documented, and safe to merge.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go (1)

547-662: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift

Extract the restart phases into helpers.

RestartAgentProcess is now 116 lines in a very stateful lifecycle path, so adding more restart behavior here pushes it further past the backend lint cap and makes the teardown/re-init flow harder to audit. Please split the stream teardown, subprocess restart, and ACP re-init phases into helpers before this grows again.

As per coding guidelines, Code-quality limits enforced by .golangci.yml: Functions ≤80 lines, ≤50 statements.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go` around
lines 547 - 662, Split RestartAgentProcess into three helper methods to reduce
size: extract the stream teardown into a helper (e.g., closeAgentStreams(ctx,
execution) that calls execution.agentctl.CloseUpdatesStream/CloseWorkspaceStream
and handles prevMode capture), extract the subprocess restart into a helper
(e.g., restartAgentSubprocess(ctx, execution, agentConfig) that calls
execution.agentctl.Stop, builds freshCmd/freshContinueCmd via
m.buildFreshAgentCommand, updates state via m.executionStore.WithLock, waits for
agentctl readiness with WaitForReady, and calls m.configureAndStartAgent), and
extract ACP re-init into a helper (e.g., initACPAfterRestart(ctx, execution,
agentConfig) that resolves MCP servers with m.resolveMcpServers and calls
m.initializeACPSessionForRestart then m.reapplySessionModeAfterReset). Keep
existing logging, error handling and calls to m.updateExecutionError, and ensure
unique symbols used above are invoked so behavior remains identical.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go`:
- Around line 547-662: Split RestartAgentProcess into three helper methods to
reduce size: extract the stream teardown into a helper (e.g.,
closeAgentStreams(ctx, execution) that calls
execution.agentctl.CloseUpdatesStream/CloseWorkspaceStream and handles prevMode
capture), extract the subprocess restart into a helper (e.g.,
restartAgentSubprocess(ctx, execution, agentConfig) that calls
execution.agentctl.Stop, builds freshCmd/freshContinueCmd via
m.buildFreshAgentCommand, updates state via m.executionStore.WithLock, waits for
agentctl readiness with WaitForReady, and calls m.configureAndStartAgent), and
extract ACP re-init into a helper (e.g., initACPAfterRestart(ctx, execution,
agentConfig) that resolves MCP servers with m.resolveMcpServers and calls
m.initializeACPSessionForRestart then m.reapplySessionModeAfterReset). Keep
existing logging, error handling and calls to m.updateExecutionError, and ensure
unique symbols used above are invoked so behavior remains identical.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 48c6006a-c99a-44dd-a85c-cec9191f0044

📥 Commits

Reviewing files that changed from the base of the PR and between 0bcdc06 and 7562635.

📒 Files selected for processing (7)
  • apps/backend/Makefile
  • apps/backend/internal/agent/runtime/agentctl/agent.go
  • apps/backend/internal/agent/runtime/agentctl/client.go
  • apps/backend/internal/agent/runtime/agentctl/client_close_test.go
  • apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
  • apps/backend/internal/agent/runtime/lifecycle/manager_interaction.go
  • apps/backend/internal/agent/runtime/lifecycle/streams.go
🚧 Files skipped from review as they are similar to previous changes (5)
  • apps/backend/internal/agent/runtime/agentctl/client_close_test.go
  • apps/backend/internal/agent/runtime/agentctl/agent.go
  • apps/backend/internal/agent/runtime/agentctl/workspace_stream.go
  • apps/backend/Makefile
  • apps/backend/internal/agent/runtime/lifecycle/streams.go

…ascade

The previous Close drain barrier for the agent (updates) stream blocked
the on_turn_complete cascade flow. completeAndStopSession calls Close
synchronously on the current session's client, and Close was waiting on
streamWG (which tracked the read goroutine). The wait stalled the
cascade long enough that Step2's auto-started session never got its
profile name into the tab — surfacing as a flaky "Profile B tab not
visible" failure in workflow-agent-switch E2E shard 7.

Drain barrier is still in place for the workspace stream (where the
original goleak flake lived). Agent stream is closed but not awaited:
the cascade legitimately stops + restarts the agent stream on the same
client, and the workspace stream's existing barrier is sufficient for
the leak coverage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown

claude Bot commented Jun 3, 2026

Claude finished @jcfs's task in 5m 9s —— View job


Code Review

Todo:

  • Load tools
  • Read changed files in current state (all 3 commits)
  • Review streams.go (lifecycle StreamManager drain barrier)
  • Review client.go + workspace_stream.go + agent.go (agentctl drain)
  • Review manager_interaction.go and Makefile
  • Review client_close_test.go
  • Check for issues raised in prior reviews
  • Post findings

Overall Assessment

Three-commit arc tells a coherent story. Commit 1 added drain barriers for both streams + workspace stream race guards. Commit 2 addressed all round-1 review findings (mergeStops sm.wg tracking, workspaceStreamConn/workspaceStream identity guards, comment accuracy, concurrent StreamWorkspace post-dial check). Commit 3 correctly identified a cascade deadlock introduced by the agent-stream synchronous drain and scoped the fix down to workspace-only — well-documented and sound.

The workspace drain barrier — the original goleak source — is solid. All round-1 and round-2 review issues are resolved. Two residual items noted below.


Findings

1. readUpdatesStream defer lacks identity guard on agentStreamConninline

agent.go has no net diff in this PR (commit 1 added changes; commit 3 reverted them), so the comment lands on client.go:695 instead. readWorkspaceStream's defer was fixed in commit 2 to guard both pointer resets with if c.workspaceStreamConn == conn. The analogous guard is absent in readUpdatesStream's defer, which unconditionally writes c.agentStreamConn = nil. Since commit 3 removes the synchronous agent-stream drain, the race window for the restart path is now wider: the old goroutine can outlive both CloseUpdatesStream() and the StreamUpdates dial in initializeACPSessionForRestart, clobbering the new conn pointer. One-line fix: if c.agentStreamConn == conn { ... }. Greptile flagged this as P1; it was acknowledged but didn't make it into the follow-up commit. Fix this →

2. Dead code guard in stopChannelContext.Done()inline

streams.go:74 — the c.primary == nil && c.secondary == nil early-return is unreachable because streamContext() always wires sm.waitCh (non-nil) as secondary. Worth a comment noting this is defensive or removing the dead branch. Minor.


What's clean

  • StreamWorkspace post-dial closed + workspaceStreamConn != nil double-check (commit 2) correctly seals both the shutdown race and the concurrent-caller race.
  • stopChannelContext.wg threading into sm.wg ensures mergeStops is tracked even when connectUpdatesStream's outer goroutine exits first — that was the subtler of the two goleak paths.
  • RestartAgentProcess comment explaining the per-stream Close vs. full client.Close() trade-off is clear and correct.
  • Test updated in commit 3 to match the narrowed contract (workspace drain only) — no dead assertions.
  • test-lifecycle-goleak Makefile target is a useful stress hook.

func (c stopChannelContext) Done() <-chan struct{} {
if c.stopCh == nil {
func (c *stopChannelContext) Done() <-chan struct{} {
if c.primary == nil && c.secondary == nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead code: this early-return is unreachable via streamContext()

streamContext() always wires sm.waitCh as secondary, and sm.waitCh is initialised unconditionally in NewStreamManager — so secondary is never nil for any context created in production or in tests that go through ConnectAll/ConnectWorkspaceStream.

The c.primary == nil && c.secondary == nil branch therefore can't fire in practice. A future reader will likely wonder whether mergeStops can be skipped or whether there's a nil-panic risk, so a short comment here would head off the confusion:

// secondary (sm.waitCh) is always non-nil, so this path is unreachable
// via streamContext(); kept as a defensive guard for external callers.
if c.primary == nil && c.secondary == nil {
    return c.Context.Done()
}

Alternatively, remove the guard and let the once.Do path always run — the merged channel has negligible overhead and removing dead code is cleaner. Either way, worth a note.

ws.Close()
ws.Wait()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agentStreamConn identity guard still missing in readUpdatesStream's defer

This note applies to readUpdatesStream's defer in agent.go (which has no net diff in this PR and can't be commented on directly).

The readWorkspaceStream defer was fixed in commit 2 to guard both workspaceStreamConn and workspaceStream resets with identity checks (if c.workspaceStreamConn == conn). The analogous guard is still absent in readUpdatesStream's defer, which unconditionally writes c.agentStreamConn = nil.

Commit 3 intentionally removed the synchronous drain barrier for the agent stream — meaning the old goroutine can outlive CloseUpdatesStream() and the subsequent StreamUpdates dial in RestartAgentProcess. The race window is now:

  1. CloseUpdatesStream() → closes old conn, sets agentStreamConn = nil
  2. StreamUpdates → new dial succeeds, sets agentStreamConn = newConn
  3. Old goroutine exits → defer fires: agentStreamConn = nil overwrites newConn
  4. Next CloseUpdatesStream() sees nil, silently skips close → active conn orphaned

The fix is one extra if-guard in readUpdatesStream's defer in agent.go:

c.mu.Lock()
if c.agentStreamConn == conn {
    c.agentStreamConn = nil
}
c.mu.Unlock()

Greptile flagged this as P1 in their review. The response indicated it would be fixed but it landed only in workspace_stream.go. Low-risk follow-up, but should be tracked before the next restart-path regression.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/backend/internal/agent/runtime/agentctl/client_close_test.go (1)

79-126: ⚠️ Potential issue | 🟠 Major

Add per-package goleak.VerifyTestMain(m) for agentctl tests.

apps/backend/internal/agent/runtime/agentctl/ has no package-level TestMain calling goleak.VerifyTestMain(m); the only goleak.VerifyTestMain occurrence is in a comment in client_close_test.go. Add a per-package TestMain that invokes goleak.VerifyTestMain(m) (suppress unavoidable third-party goroutines as needed).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@apps/backend/internal/agent/runtime/agentctl/client_close_test.go` around
lines 79 - 126, Add a package-level TestMain that calls goleak.VerifyTestMain(m)
for the agentctl tests to detect goroutine leaks (e.g. alongside the existing
TestClientClose_DrainsWorkspaceStream test); implement TestMain(m *testing.M) to
call goleak.VerifyTestMain(m) and return os.Exit(m.Run()), and include a
goleak.IgnoreTopFunction or IgnoreTopPackage entries to suppress known
unavoidable third-party goroutines (adjust the ignore list as necessary) so
legitimate external goroutines don't cause failures.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@apps/backend/internal/agent/runtime/agentctl/client_close_test.go`:
- Around line 79-126: Add a package-level TestMain that calls
goleak.VerifyTestMain(m) for the agentctl tests to detect goroutine leaks (e.g.
alongside the existing TestClientClose_DrainsWorkspaceStream test); implement
TestMain(m *testing.M) to call goleak.VerifyTestMain(m) and return
os.Exit(m.Run()), and include a goleak.IgnoreTopFunction or IgnoreTopPackage
entries to suppress known unavoidable third-party goroutines (adjust the ignore
list as necessary) so legitimate external goroutines don't cause failures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 267a7a64-9f3b-4d93-9279-5f95db886efa

📥 Commits

Reviewing files that changed from the base of the PR and between 7562635 and 994df5b.

📒 Files selected for processing (2)
  • apps/backend/internal/agent/runtime/agentctl/client.go
  • apps/backend/internal/agent/runtime/agentctl/client_close_test.go

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apps/backend/internal/agent/runtime/agentctl/client.go">

<violation number="1" location="apps/backend/internal/agent/runtime/agentctl/client.go:49">
P1: Agent stream restart can clobber live connection pointer. Old read goroutine sets `agentStreamConn=nil` after new stream starts. Guard cleanup by connection identity.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

// closed flips to true on Client.Close and prevents new StreamWorkspace
// dials from leaking goroutines past the close barrier. Agent (updates)
// stream is not gated on this flag because the cascade flow legitimately
// stops + restarts the agent stream on the same client; gating it would
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Agent stream restart can clobber live connection pointer. Old read goroutine sets agentStreamConn=nil after new stream starts. Guard cleanup by connection identity.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At apps/backend/internal/agent/runtime/agentctl/client.go, line 49:

<comment>Agent stream restart can clobber live connection pointer. Old read goroutine sets `agentStreamConn=nil` after new stream starts. Guard cleanup by connection identity.</comment>

<file context>
@@ -43,19 +43,17 @@ type Client struct {
+	// closed flips to true on Client.Close and prevents new StreamWorkspace
+	// dials from leaking goroutines past the close barrier. Agent (updates)
+	// stream is not gated on this flag because the cascade flow legitimately
+	// stops + restarts the agent stream on the same client; gating it would
+	// strand workflow step transitions on a closed client.
 	closed bool
</file context>

@jcfs jcfs merged commit 9aaf317 into main Jun 3, 2026
50 checks passed
@jcfs jcfs deleted the feature/fix-lifecycle-goleak-xdu branch June 3, 2026 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant